Annotating the Legacy Web with Lixto
نویسندگان
چکیده
Introduction The Semantic Web is still a vision. The unstructured Web of today contains millions of documents which cannot be queried and where layout and structure are heavily mixed. Moreover, they are not annotated at all. There is a huge gap between Web information and the qualified, structured data as required in corporate information systems. According to the vision of the Semantic Web, all information available on the Web will be suitably structured, annotated, and qualified in the future. However, until this goal is reached, and also, towards a faster achievement of this goal, relevant data can be (semi-)automatically extracted from HTML documents and automatically translated into a structured format, e.g., XML. A program that automatically extracts data and transforms it into another format (markups the content with semantic information) is called a wrapper. Intelligent content extraction provides the foundation for automatic generation of semantic markup. Various approaches to automatic content extraction have been proposed, ranging from machine learning techniques to pattern recognition techniques. However, these approaches in general fail to produce useful results due to the complexity of Web pages. Other approaches suggest the manual editing of script files that wrap the relevant data from Web pages into more structured formats. Such processes are time-consuming, hard to understand for non-technical wrapper designers, and script files are not easy to maintain. We propose another approach a supervised and visual definition of content extraction. Based on interactively identifying and extracting relevant parts of HTML documents and translating content to XML format, we designed and implemented the efficient wrapper generation tool Lixto Visual Wrapper [2] which is well-suited for building HTML to XML wrappers. Such a wrapper can be applied to continually extract relevant information from this class of Web pages. The process of wrapping consists in two steps: First, the identification phase, where relevant fragments of Web pages are extracted. Such extraction rules are semi-automatically specified by a wrapper designer (e.g., in a purely visual way in Lixto Visual Wrapper). This step is followed by the structuring phase where the extracted data is mapped to some destination format, e.g., enriching it with XML tags. With respect to the Semantic Web, a third phase is required: Each information unit needs to be put into relation with other pieces of information. Additionally, methods for automatically relating extracted information to existing domain ontologies are required. The aspect of integration of extracted data plays an important role in querying the Web; data schemes, metadata, and background knowledge such as ontologies must be integrated to reach a uniform semantic interpretation of information. Lixto Transformation Server [6] covers the integration, annotation and delivery process as part of the Lixto set of tools. As long as the Web is not a source of machine-readable data but still merely accessible for human beings, wrapper technology provides the possibility for computers to query and interact with it [3].
منابع مشابه
Web Information Acquisition with Lixto Suite: A Demonstration∗
We demonstrate the Lixto Suite, a web data extraction and transformation software kit for retrieving and converting information from various sources to various customer devices. With the Lixto Suite, non-technical content managers can rapidly develop applications in the areas of M-Commerce, E-Commerce, content integration and corporate portals.
متن کاملVisual Web Information Extraction with Lixto
We present new techniques for supervised wrapper generation and automated web information extraction, and a system called Lixto implementing these techniques. Our system can generate wrappers which translate relevant pieces of HTML pages into XML. Lixto, of which a working prototype has been implemented, assists the user to semi-automatically create wrapper programs by providing a fully visual ...
متن کاملDeclarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto
Lixto is a system and method for the visual and interactive generation of wrappers for Web pages under the supervision of a human developer, for automatically extracting information from Web pages using such wrappers, and for translating the extracted content into XML. This paper describes some advanced features of Lixto, such as disjunctive pattern definitions, specialization rules, and Lixto’...
متن کاملLixto – Price Intelligence Suite
IMPACT The Lixto Price Intelligence Suite is a solution that extracts price-comparison information from competitor online web channels and combines it with internal data sources so organizations gain greater visibility into the factors that might influence price. The suite works by navigating and extracting competitive product and pricing information from predefined online data sources before s...
متن کاملThe Lixto Project: Exploring New Frontiers of Web Data Extraction
The Lixto project is an ongoing research effort in the area of Web data extraction. Whereas the project originally started out with the idea to develop a logic-based extraction language and a tool to visually define extraction programs from sample Web pages, the scope of the project has been extended over time. Today, new issues such as employing learning algorithms for the definition of extrac...
متن کامل